Mon. Not. R. Astron. Soc. 000, 000-000 (2007) Printed 1 February 2008 (MN KTeX style file v2.2) 



Multimodal nested sampling: an efficient and robust alternative to 
MCMC methods for astronomical data analysis 

Farhan Feroz* and M.P. Hobson 

Astrophysics Group, Cavendish Laboratory. JJ Thomson Avenue, Cambridge CBS OHE, UK 



Accepted — . Received — ; in original form 1 February 2008 



ABSTRACT 

In performing a Bayesian analysis of astronomical data, two difficult problems often emerge. 
First, in estimating the parameters of some model for the data, the resulting posterior distri- 
bution may be multimodal or exhibit pronounced (curving) degeneracies, which can cause 
problems for traditional Markov Chain Monte Carlo (MCMC) sampling methods. Second, 
in selecting between a set of competing models, calculation of the Bayesian evidence for 
each model is computationally expensive using existing methods such as thermodynamic in- 
tegration. The nested sampling method introduced by SkillinJ ( |2004|) . has greatly reduced 
the computational expense of calculating evidences and also produces posterior inferences 
as a by-product. This m ethod has been applied successfully in cosmological applications by 
iMukheriee et alj (l2006h . but their imple mentation was effic ient only for unimodal distribu- 
tions without pronounced degeneracies. IShaw et al] (1200 7') recently introduced a clustered 
nested sampling method which is significantly more efficient in sampling from multimodal 
posteriors and also determines the expectation and variance of the final evidence from a single 
run of the algorithm, hence providing a further increase in efficiency. In this paper, we build on 
the work of Shaw et al. and present three new methods for sampling and evidence evaluation 
from distributions that may contain multiple modes and significant degeneracies in very high 
dimensions; we also present an even more efficient technique for estimating the uncertainty on 
the evaluated evidence. These methods lead to a further substantial improvement in sampling 
efficiency and robustness, and are applied to two toy problems to demonstrate the accuracy 
and economy of the evidence calculation and parameter estimation. Finally, we discuss the 
use of these methods in performing Bayesian object detection in astronomical datasets, and 
show that they significantly outperform existing MCMC techniques. An implementation of 
our methods will be publicly released shortly. 
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1 INTRODUCTION problematic. Bayesian model selection has been hindered by the 

computational expense involved in the calculation to sufficient pre- 



Bayesian analysis methods are now widely used in astrophysics 
and cosmology, and it is thus important to develop methods for 
performing such analyses in an efficient and robust manner. In 
general, Bayesian inference divides into two categories: parame- 
ter estimation and model selection. Bayesian parameter estimation 



cision of the key ingredient, the Bayesian evidence (also called 

the marginalised likelihood or the marginal density of the data). 
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tion is clearly a poor one for multimodal posteriors (except perhaps 
if one performs a separate Gaussian approximation at each mode). 
The Savage-Dickey density ratio has also been proposed (Trotta 
2005) as an exact, and potentially faster, means of evaluating ev- 
idences, but is restricted to the special case of nested hypotheses 
and a separable prior on the model parameters. Various alternative 
in formation crite ria for astrophysical model selection are discussed 
bv lLiddldlioOTh . but the evidence remains the preferred method. 

The nested sampling approach (Skilling 2004) is a Monte 
Carlo method targetted at the efficient calculation of the evidence, 
but also produces p osterior inferences as a by-product. In cosmo- 
logical applications. lMukheriee et ^ j2006l) show that their imple- 
mentation of the method requires a factor of ~ 100 fewer pos- 
terior evaluations than thermodynamic integration. To achieve an 
improved acceptance ratio and efficiency, their algorithm uses an 
elliptical bound containing the current point set at each stage of the 
process to restrict the re gion around the po sterior peak from which 
new samples are drawn. Shaw et aU j2007l) point out, however, that 
this method becomes highly inefficient for multimodal posteriors, 
and hence introduce the notion of clustered nested sampling, in 
which multiple peaks in the posterior are detected and isolated, and 
separate ellipsoidal bounds are constructed around each mode. This 
approach significantly increases the sampling efficiency. The over- 
all computational load is reduced still further by the use of an im- 
proved error calculation (Skilling 2004) on the final evidence result 
that produces a mean and standard error in one sampling, eliminat- 
ing the need for multiple runs. 

In this paper, we build on the work of IShaw et alJ i2007h . by 
pursuing further the notion of detecting and characterising multiple 
modes in the posterior from the distribution of nested samples. In 
particular, within the nested sampling paradigm, we suggest three 
new algorithms (the first two based on sampling from ellipsoidal 
bounds and the third on the Metropolis algorithm) for calculating 
the evidence from a multimodal posterior with high accuracy and 
efficiency even when the number of modes is unknown, and for 
producing reliable posterior inferences in this case. The first algo- 
rithm samples from all the modes simultaneously and provides an 
efficient way of calculating the 'global' evidence, while the second 
and third algorithms retain the notion from Shaw et al. of iden- 
tifying each of the posterior modes and then sampling from each 
separately. As a result, these algorithms can also calculate the 'lo- 
cal' evidence associated with each mode as well as the global evi- 
dence. All the algorithms presented differ from that of Shaw et al. 
in several key ways. Most notably, the identification of posterior 
modes is performed using the X-means clustering algorithm (Pel- 
leg et al. 2000), rather than fc-means clustering with A: = 2; we find 
this leads to a substantial improvement in sampling efficiency and 
robustness for highly multimodal posteriors. Further innovations 
include a new method for fast identification of overlapping ellip- 
soidal bounds, and a scheme for sampling consistently from any 
such overlap region. A simple modification of our methods also 
enables efficient sampling from posteriors that possess pronounced 
degeneracies between parameters. Finally, we also present a yet 
more efficient method for estimating the uncertainty in the calcu- 
lated (local) evidence value(s) from a single run of the algorithm. 
The above innovations mean our new methods constitute a viable, 
general replacement for traditional MCMC sampling techniques in 
astronomical data analysis. 

The outline of the paper is as follows. In section 2, we briefly 
review the basic aspects of Bayesian inference for parameter esti- 
mation and model selection. In section 3 we introduce nested sam- 
pling and discuss the ellipsoidal nested sampling technique in sec- 



tion 4. We present two new algorithms based on ellipsoidal sam- 
pling and compare them with previous methods in section 5, and in 
Section 6 we present a new method based on the Metropolis algo- 
rithm. In section 7, we apply our new algorithms to two toy prob- 
lems to demonstrate the accuracy and efficiency of the evidence 
calculation and parameter estimation as compared with other tech- 
niques. In section 8, we consider the use of our new algorithms in 
Bayesian object detection. Finally, our conclusions are presented in 
Section 9. 



2 BAYESIAN INFERENCE 

Bayesian inference methods provide a consistent approach to the 
estimation of a set parameters in a model (or hypothesis) H for 
the data D. Bayes' theorem states that 



Pr(0!D,//) 



Pr(DjQ,H)Pr(Q|ij') 
Y>r[T)\H) ' 



(1) 



where Pr(0|D,//) = P{@) is the posterior probability distri- 
bution of the parameters, Pr(D|©, H) = 1/(0) is the likelihood, 
Pr(©|J/) = 7r(0)is the prior, and Pr(D|Ji") = 2 is the Bayesian 
evidence. 

In parameter estimation, the normalising evidence factor is 
usually ignored, since it is independent of the parameters 0, and 
inferences are obtained by taking samples from the (unnormalised) 
posterior using standard MCMC sampling methods, where at equi- 
librium the chain contains a set of samples from the parameter 
space distributed according to the posterior. This posterior consti- 
tutes the complete Bayesian inference of the parameter values, and 
can be marginalised over each parameter to obtain individual pa- 
rameter constraints. 

In contrast to parameter estimation problems, in model selec- 
tion the evidence takes the central role and is simply the factor re- 
quired to normalize the posterior over 0: 



Z = j L(0)7r(0)d°0, 



(2) 



where D is the dimensionality of the parameter space. As the av- 
erage of the likelihood over the prior, the evidence is larger for 
a model if more of its parameter space is likely and smaller for a 
model with large areas in its parameter space having low likelihood 
values, even if the likelihood function is very highly peaked. Thus, 
the evidence automatically implements Occam's razor: a simpler 
theory with compact parameter space will have a larger evidence 
than a more complicated one, unless the latter is significantly bet- 
ter at explaining the data. The question of model selection between 
two models Hq and H\ can then be decided by comparing their 
respective posterior probabilities given the observed data set D, as 
follows 

Pr(f/'i|D) _ Pr(D|_f/'i)Pr(i?i) _ Zx Pr(ffi) 



Pr(ffolD) Vy{T>\Ho)Vx{Ho) ZoPr(Ho 



(3) 



where Pt{Hi)/ Ft{Ho) is the a priori probability ratio for the two 
models, which can often be set to unity but occasionally requires 
further consideration. 

Unfortunately, evaluation of the multidimensional integral ^ 
is a challenging numerical task. The standard technique is ther- 
modynamic integration, which uses a modified form of MCMC 
sampling. The dependence of the evidence on the prior requires 
that the prior space is adequately sampled, even in regions of low 
likelihood. To achieve this, the thermodynamic integration tech- 
nique draws MCMC samples not from the posterior directly but 
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Figure 1. Proper thermodynamic integration requires the log-likelihood to 
be concave like (a), not (b). 



from L TV where A is an inverse temperature that is raised from 
~ to 1. For low values of A, peaks in the posterior are suffi- 
ciently suppressed to allow improved mobility of the chain over 
the entire prior range. Typically it is possible to obtain accuracies 
of within 0.5 units in log-evidence via this method, but in cosmo- 
logical applications it typically requires of order 10® samples per 
chain (with around 10 chains required to determine a sampling er- 
ror). This makes evidence evaluation at least an order of magnitude 
more costly than parameter estimation. 

Another problem faced by thermodynamic integrat ion is in 
naviga ting through phase changes as pointed out by ISkill ing 
( l2004h . As A increases from to 1, one hopes that the thermody- 
namic integration tracks gradually up in L so inwards in X as illus- 
trated in Fig.[TJa). A is related to the slope of log L / log X curve as 
d\og L / dlog X = — 1/A. This requires the log-likelihood curve 
to be concave as in Fig. [TJa). If the log-likelihood curve is non- 
concave as in 

Fig.mb). then increasing A from to 1 will normally 
take the samples from A to the neighbourhood of B where the slope 
is —1/A — —1. In order to get the samples beyond B, A will need 
to be taken beyond 1 . Doing this will take the samples around the 
neighbourhood of the point of inflection C but here thermodynamic 
integration sees a phase change and has to jump across, somewhere 
near F, in which any practical computation exhibits hysteresis that 
destroys the calculation of Z. As will be discussed in the next sec- 
tion, nested sampling does not experience any problem with phase 
changes and moves steadily down in the prior volume X regardless 
of whether the log-likelihood is concave or convex or even differ- 
entiable at all. 



3 NESTED SAMPLING 

Nested sampling (Skilling 2004) is a Monte Carlo technique aimed 
at efficient evaluation of the Bayesian evidence, but also produces 
posterior inferences as a by-product. It exploits the relation between 
the likelihood and prior volume to transform the multidimensional 
evidence integral {2} into a one-dimensional integral. The 'prior 
volume' X is defined by dX = 7r(0)d^0, so that 

X(A) = / 7r(0)d°©, (4) 

where the integral extends over the region(s) of parameter space 
contained within the iso-likelihood contour L{&) — A. Assuming 
that L{X), i.e. the inverse of is a monotonically decreasing 
function of X (which is trivially satisfied for most posteriors), the 
evidence integral (|2} can then be written as 

2 = [ L{X)dX. (5) 
Jo 
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(a) (b) 

Figure 2. Cartoon illustrating (a) the posterior of a two dimensional prob- 
lem; and (b) the transformed L(X) function where the prior volumes Xi 
are associated with each likelihood Li. 



Thus, if one can evaluate the likelihoods Lj — L{Xj), where Xj 
is a sequence of decreasing values, 

< Xm < ■ ■ ■ < X2 < Xi < Xo ^ 1, (6) 

as shown schematically in Fig.|2l the evidence can be approximated 
numerically using standard quadrature methods as a weighted sum 

M 

Z = '^LiWi. (7) 

1=1 

In the following we will use the simple trapezium rule, for which 
the weights are given by Wi — — Xi+i). An example of 

a posterior in two dimensions and its associated function L{X) is 
shown in Fig.|2] 

3.1 Evidence evaluation 

The nested sampling algorithm performs the summation {TJ as fol- 
lows. To begin, the iteration counter is set to i = and TV 'live' 
(or 'active') samples are drawn from the full prior 7r(0) (which 
is often simply the uniform distribution over the prior range), so 
the initial prior volume is Xq = 1. The samples are then sorted 
in order of their likelihood and the smallest (with likelihood Lo) 
is removed from the live set and replaced by a point drawn from 
the prior subject to the constraint that the point has a likelihood 
L > Lo - The corresponding prior volume contained within this iso- 
likelihood contour will be a random variable given by Xi — tiXo, 
where ti follows the distribution Pr(t) = Nt^^^ (i.e. the prob- 
ability distribution for the largest of A*' samples drawn uniformly 
from the interval [0, 1]). At each subsequent iteration i, the discard- 
ing of the lowest likelihood point Li in the live set, the drawing of 
a replacement with L > Li and the reduction of the corresponding 
prior volume Xi — tiXi-i are repeated, until the entire prior vol- 
ume has been traversed. The algorithm thus travels through nested 
shells of likelihood as the prior volume is reduced. 

The mean and standard deviation of in t, which dominates the 
geometrical exploration, are: 

E[\nt] = ~l^, a[lnt] = l. (8) 

Since each value of In t is independent, after i iterations the prior 
volume will shrink down such that \nXi ~ — (i ± \/i)/N . Thus, 
one takes Xi — exp{—i/N). 

3.2 Stopping criterion 

The nested sampling algorithm should be terminated on determin- 
ing the evidence to some specified precision. One way would be to 
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proceed until the evidence estimated at each replacement changes 
by less than a specified tolerance. This could, however, underesti- 
mate the evidence in (for example) cases whe re the posterior con- 
tains any narrow peaks close to its maximum. ISkillind j2004h pro- 
vides an adequate and robust condition by determining an upper 
limit on the evidence that can be determined from the remaining 
set of current active points. By selecting the maximum-likelihood 
imax in the set of active points, one can safely assume that the 
largest evidence contribution that can be made by the remaining 
portion of the posterior is A^i = Lma^Xi, i.e. the product of 
the remaining prior volume and maximum likelihood value. We 
choose to stop when this quantity would no longer change the fi- 
nal evidence estimate by some user-defined value (we use 0.1 in 
log-evidence). 



3.3 Posterior inferences 

Once the evidence Z is found, posterior inferences can be eas- 
ily generated using the full sequence of discarded points from the 
nested sampling process, i.e. the points with the lowest likelihood 
value at each iteration i of the algorithm. Each such point is simply 
assigned the weight 



Pi 



LiWi 

z ■ 



(9) 



These samples can then be used to calculate inferences of posterior 
parameters such as means, standard deviations, covariances and so 
on, or to construct marginalised posterior distributions. 



the weight Wi = ^{Xi-i — Xi+i) decreases. At some point L 
flattens off sufficiently that the decrease in the weight dominates 
the increase in likelihood, so the increment LiWi reaches a max- 
imum and then starts to drop with iteration number. Most of the 
contribution to the final evidence value usually comes from the it- 
erations around the maximum point, which occurs in the region of 
X ~ , where H is the negative relative entropy. 



H 



In 



dP 

Ix 



dX 



r / r 



z 



(12) 



where P denotes the posterior. Since In Xi ~ (— i±\/i) /N, we ex- 
pect the procedure to take about NH ± V NH steps to shrink down 
to the bulk of the posterior. The dominant uncertainty in Z is due 
to the Poisson variability NH ± V NH in the number of steps to 
reach the posterior bulk. Correspondingly the accumulated values 
\nXi are subject to a standard deviation uncertainty of H/N. 
This uncertainty is transmitted to the evidence Z through Q, so 
that InZ also has standard deviation uncertainty of ^JhJN . Thus, 
putting the results together gives 



\uZ = In 



(13) 



Alongside the above uncertainty, there is also the error due to 
the discretisation of the integral in Q. Using the trapezoidal rule, 
this error will be C'(l/M^), and hence will be negligible given a 
sufficient number of iterations. 



3.4 Evidence error estimation 

If we could assign each Xi value exactly then the only error in 
our estimate of the evidence would be due to the discretisation of 
the integral l|7j. Since each ti is a random variable, however, the 
dominant source of uncertainty in the final Z value arises from the 
incorrect assignment of each prior volume. Fortunately, this uncer- 
tainty can be easily estimated. 

Shaw et al. made use of the knowledge of the distribution 
Vxiti) from which each ti is drawn to assess the errors in any 
quantities calculated. Given the probability of the vector t = 
(ti,t2, • . • ,iM) as 

M 

Pr(t) =nPr(iO, (10) 

one can write the expectation value of any quantity FiX) as 

= |F(t)Pr(t)d"t. (11) 

Evaluation of this integral is possible by Monte Carlo methods by 
sampling a given number of vectors t and finding the average F. By 
this method one can determine the variance of the curve in X ~ L 
space, and thus the uncertainty in the evidence integral J L{X)dX. 
As demonstrated by Shaw et al., this eliminates the need for any 
repetition of the algorithm to determine the standard error on the 
evidence value; this constitutes a significant increase in efficiency. 

In our new methods presented below, h owever, we use a dif- 
ferent error estimation scheme suggested by ISkillind ( |2004) : this 
also provides an error estimate in a single sampling but is far less 
computationally expensive and proceeds as follows. The usual be- 
haviour of the evidence increments LiWi is initially to rise with 
iteration number i, with the likelihood Li increasing faster than 



4 ELLIPSOIDAL NESTED SAMPLING 

The most challenging task in implementing the nested sampling 
algorithm is drawing samples from the prior within the hard con- 
straint L > Li at each iteration i. Employing a naive approach that 
draws blindly from the prior would result in a steady decrease in 
the acceptance rate of new samples with decreasing prior volume 
(and increasing likelihood). 

4.1 Single ellipsoid sampling 

Ellipsoidal sampling ( Mukherie e et al.l ilQQ& i) partially overcomes 
the above problem by approximating the iso-likelihood contour of 
the point to be replaced by an D-dimensional ellipsoid determined 
from the covariance matrix of the current set of live points. This 
ellipsoid is then enlarged by some factor / to account for the iso- 
likelihood contour not being exactly ellipsoidal. New points are 
then selected from the prior within this (enlarged) ellipsoidal bound 
until one is obtained that has a likelihood exceeding that of the dis- 
carded lowest-likelihood point. In the limit that the ellipsoid coin- 
cides with the true iso-likelihood contour, the acceptance rate tends 
to unity. An elegant method for dra wing uniform samples from an 
D-dimensional ellipsoid is given bv lShawetal1[2007i) . and is eas- 
ily extended to non-uniform priors. 

4.2 Recursive clustering 

Ellipsoidal nested sampling as described above is efficient for sim- 
ple unimodal posterior distributions, but is not well suited to mul- 
timodal distributions. The problem is illustrated in Fig. [3] in which 
one sees that the sampling efficiency from a single ellipsoid drops 
rapidly as the posterior value increases (particularly in higher di- 
mensions). As advocated by Shaw et al., and illustrated in the final 
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(a) (b) (c) (d) (e) 

Figure 3. Cailoon of ellipsoidal nested sampling from a simple bimodal distribution. In (a) we see that the ellipsoid represents a good bound to the active 
region. In (b)-(d), as we nest inward we can see that the acceptance rate will rapidly decrease as the bound steadily worsens. Figure (e) illustrates the increase 
in efficiency obtained by sampUng from each clustered region separately. 



panel of the figure, the efficiency can be substantially improved by 
identifying distinct clusters of five points that are well separated 
and constructing an individual ellipsoid for each cluster. The linear 
nature of the evidence means it is valid to consider each cluster in- 
dividually and sum the contributions provided one correctly assigns 
the prior volumes to each distinct region. Since the collection of A'^ 
active points is distributed evenly across the prior one can safely 
assume that the number of points within each clustered region is 
prop ortional to t h e prio r volume contained therein. 

IShaw et al] ( l2007h identify clusters recursively. Initially, at 
each iterati on i of the nested sampling algorithm, fc-means cluster- 
ing (see e.g.'M acKavl ( 12003 ^) with fc = 2 is applied to the live set of 
points to partition them into two clusters and an (enlarged) ellipsoid 
is constructed for each one. This division of the live set will only 
be accepted if two further conditions are met: (i) the total volume 
of the two ellipsoids is less than some fraction of the original pre- 
clustering ellipsoid and (ii) clusters are sufficiently separated by 
some distance to avoid overlapping regions. If these conditions are 
satisfied clustering will occur and the number of live points in each 
cluster are topped-up to A'^ by sampling from the prior inside the 
corresponding ellipsoid, subject to the hard constraint L > Li. The 
algorithm then searches independently within each cluster attempt- 
ing to divide it further. This process continues recursively until the 
stopping criterion is met. Shaw et al. also show how the error es- 
timation procedure can be modified to accommodate clustering by 
finding the probability distribution of the volume fraction in each 
cluster. 



5 IMPROVED ELLIPSOIDAL SAMPLING METHODS 

In this section, we present two new methods for ellipsoidal nested 
sampling that improve significantly in terms of sampling efficiency 
and robustness on the existing techniques outlined above, in par- 
ticular for multimodal distributions and those with pronounced de- 
generacies. 

5.1 General improvements 

We begin by noting several general improvements that are em- 
ployed by one or other of our new methods. 

5.1.1 Identification of clusters 

In both methods, we wish to identify isolated modes of the poste- 
rior distribution without prior knowledge of their number. The only 
information we have is the current live point set. Rather than using 
fc-means clustering with fc = 2 to partition the points into just two 



clusters at each iteration, we instead attempt to infer the appropri- 
ate number of clusters from the point set. After experimenting with 
several clustering algorithms to partition th e points into t he opti- 
mal nu mber of clusters, we found X-means ( Pelleg et al. 2000), G- 
means jHamerlv et alj2003l) and PG-means (Fen g et al. 2()06) to be 
the most promising. X-means partitions the points into the number 
of clusters that optimizes the Bayesian Information Criteria (BIC) 
measure. The G-means algorithm is based on a statistical test for 
the hypothesis that a subset of data follows a Gaussian distribution 
and runs fc-means with increasing fc in a hierarchical fashion until 
the test accepts the hypothesis that the data assigned to each fc- 
means centre are Gaussian. PG-means is an extension of G-means 
that is able to learn the number of clusters in the classical Gaus- 
sian mixture model without using fc-means. We found PG-means 
to outperform both X-means and G-means, especially in higher di- 
mensions and if there are cluster intersections, but the method re- 
quires Monte Carlo simulations at each iteration to calculate the 
critical values of the Kolmogorov-Smirnov test it uses to check for 
Gaussianity. As a result, PG-means is considerably more computa- 
tionally expensive than both X-means and G-means, and this com- 
putational cost quickly becomes prohibitive. Comparing X-means 
and G-means, we found the former to produce more consistent re- 
sults, particularly in higher dimensions. Since we have to cluster 
the live points at each iteration of the nested sampling process, we 
thus chose to use the X-means clustering algorithm. This method 
performs well overall, but does suffers from some occasional prob- 
lems that can result in the number of clusters identified being more 
or less than the actual number. We discuss these problems in the 
context of both our implementations in sections [?!2l and l531 but con- 
clude they do not adversely affect out methods. Ideally, we require 
a fast and robust clustering algorithm that always produces reliable 
results, particularly in high dimensions. If such a method became 
available, it could easily be substituted for X-means in either of our 
sampling techniques described below. 



5.1.2 Dynamic enlargement factor 

Once an ellipsoid has been constructed for each identified cluster 
such that it (just) encloses all the corresponding live points, it is 
enlarged by some factor /, as discussed in Sec. |4] It is worth re- 
membering that the corresponding increase in volume is (1 + /)^, 
where D is the dimension of the parameter space. The factor / does 
not, however, have to remain constant. Indeed, as the nested sam- 
pling algorithm moves into higher likelihood regions (with decreas- 
ing prior volume), the enlargement factor / by which an ellipsoid 
is expanded can be made progressively smaller. This holds since 
the ellipsoidal approximation to the iso-likelihood contour obtained 
from the N live points becomes increasingly accurate with decreas- 
ing prior volume. 
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Figure 4. If the ellipsoids corresponding to different modes are overlapping 
then sampling from one ellipsoid, enclosing all the points, can be quite in- 
efficient. Multiple overlapping ellipsoids present a better approximation to 
the iso-likelihood contour of a multimodal distribution. 



Also, when more than one ellipsoid is constructed at some 
iteration, the ellipsoids with fewer points require a higher enlarge- 
ment factor than those with a larger number of points. This is due 
to the error introduced in the evaluation of the eigenvalues from the 
covariance matrix calculated from a limited sample size. The stan- 
dard d eviation uncertainty in the eigenvalues is given bv lGirshi"c3 
( 1 19391) as follows: 

G{\,)^\,^/2j^, (14) 

where Aj denotes the jth eigenvalue and n is the number of points 
used in the calculation of the covariance matrix. 

The above considerations lead us to set the enlargement factor 
for the fcth ellipsoid at iteration i as fi,k = foX" \J N/rik where 
A'^ is the total number of live points, /o is the initial user-defined 
enlargement factor (defining the percentage by which each axis of 
an ellipsoid enclosing A'^ points, is enlarged), Xi is the prior volume 
at the ith iteration, Uk is the number of points in the fc*'' cluster, 
and a is a value between and 1 that defines the rate at which the 
enlargement factor decreases with decreasing prior volume. 



5.1.3 Detection of overlapping ellipsoids 

In some parts of our sampling methods, it is important to have a 
very fast method to determine whether two ellipsoids intersect, as 
this operation is performed many times at each iteration. Rather 
than applying the heuristic criteria used b y Shaw et al., we instead 
employ an exact algorithm proposed by Alfano et al. (2003) which 
involves the calculation of eigenvalues and eigenvectors of the co- 
variance matrix of the points in each ellipsoid. Since we have al- 
ready calculated these quantities in constructing the ellipsoids, we 
can rapidly determine if two ellipsoids intersect at very little extra 
computational cost. 



5.1.4 Sampling from overlapping ellipsoids 

As illustrated earlier in Fig. [3] for a multimodal distribution mul- 
tiple ellipsoids represent a much better approximation to the iso- 
likelihood contour than a single ellipsoid containing all the live 
points. At likelihood levels around which modes separate, X-means 
will often partition the point set into a number of distinct clus- 
ters, but the (enlarged) ellipsoids enclosing distinct identified clus- 
ters will tend to overlap (see Fig. |4} and the partitioning will be 
discarded. At some sufficiently higher likelihood level, the corre- 
sponding ellipsoids will usually no longer overlap, but it is wasteful 



to wait for this to occur. Hence, in both of our new sampling meth- 
ods described below it will prove extremely useful to be able to 
sample consistently from ellipsoids that may be overlapping, with- 
out biassing the resultant evidence value or posterior inferences. 

Suppose at iteration i of the nested sampling algorithm, a set 
of live points is partitioned into K clusters by X-means, with the 
fc*'' cluster having rifc points. Using the covariance matrices of each 
set of points, each cluster then is enclosed in an ellipsoid which is 
then expanded using an enlargement factor The volume Vk of 
each resulting ellipsoid is then found and one ellipsoid is chosen 
with probability pk equal to its volume fraction: 

Pk = Vk/Vtot, (15) 

where Kot ~ X^fcLi '^k. Samples are then drawn from the cho- 
sen ellipsoid until a sample is found for which the hard constraint 
L > Li is satisfied, where Li is the lowest-likelihood value among 
all the live points under consideration. There is, of course, a pos- 
sibility that the chosen ellipsoid overlaps with one or more other 
ellipsoids. In order to take an account of this possibility, we find 
the number of ellipsoids, rie, in which the sample lies and only ac- 
cept the sample with probability l/rie . This provides a consistent 
sampling procedure in all cases. 

5.2 Method 1: simultaneous ellipsoidal sampling 

This method is built in large part around the above technique for 
sampling consistently from potentially overlapping ellipsoids. At 
each iteration i of the nested sampling algorithm, the method pro- 
ceeds as follows. The full set of A'^ live points is partitioned using 
X-means, which returns K clusters with ni, 712, . . . , uk points re- 
spectively. For each cluster, the covariance matrix of the points is 
calculated and used to construct an ellipsoid that just encloses all 
the points; each ellipsoid is then expanded by the enlargement fac- 
tor fi^k (which can depend on iteration number i as well as the 
number of points in the fcth ellipsoid, as outlined above). This re- 
sults in a set of K ellipsoids ei , 62 , . . . , at each iteration, which 
we refer to as sibling ellipsoids. The lowest-likelihood point (with 
likelihood Li) from the full set of A^ live points is then discarded 
and replaced by a new point drawn from the set of sibling ellip- 
soids, correctly taking into account any overlaps. 

It is worth noting that at early iterations of the nested sampling 
process, X-means usually identifies only K = 1 cluster and the 
corresponding (enlarged) ellipsoid completely encloses the prior 
range, in which case sampling is performed from the prior range 
instead. Beyond this minor inconvenience, it is important to recog- 
nise that any drawbacks of the X-means clustering method have 
little impact on the accuracy of the calculated evidence or posterior 
inferences. We use X-means only to limit the remaining prior space 
from which to sample, in order to increase efficiency. If X-means 
returns greater or fewer than the desired number of clusters, one 
would still sample uniformly from the remaining prior space since 
the union of the corresponding (enlarged) ellipsoids would still en- 
close all the remaining prior volume. Hence, the evidence calcu- 
lated and posterior inferences would remain accurate to within the 
uncertainties discussed in Sec. 13.41 

5.3 Method 2: clustered ellipsoidal sampling 

This method is closer in spirit to the recursive clustering technique 
advocated by Shaw et al. At the ith iteration of the nested sam- 
pling algorithm, the method proceeds as follows. The full set of N 
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live points is again partitioned using X-means to obtain K clus- 
ters with Til, n2, riK points respectively, and each cluster is en- 
closed in an expanded ellipsoid as outlined above. In this second 
approach, however, each ellipsoid is then tested to determine if it in- 
tersects with any of its sibling ellipsoids or any other non-ancestor 
ellipsoicfl The nested sampling algorithm is then continued sepa- 
rately for each cluster contained within a non-intersecting ellipsoid 
efc, after in each case (i) topping up the number of points to N 
by sampling N points within that satisfy L > Li\ and 

(ii) setting the corresponding remaining prior volume to X^*^' = 
Xi^i{nk/N). Finally, the remaining set of Nr points contained 
within the union of the intersecting ellipsoids at iteration i is topped 
up to TV using the method for sampling from such a set of ellipsoids 
outlined in Sec. 15.1.41 and the associated remaining prior volume is 
settoXi = Xi-i{Nr/N). 

As expected, in the early stages, X-means again usually iden- 
tifies only K = 1 cluster and this is dealt with as in Method 1. 
Once again, the drawbacks of X-means do not have much impact 
on the accuracy of the global evidence determination. If X-means 
finds fewer clusters than the true number of modes, then some clus- 
ters correspond to more than one mode and will have an enclosing 
ellipsoid larger than it would if X-means had done a perfect job; 
this increases the chances of the ellipsoid intersecting with some 
of its sibling or non-ancestor ellipsoids. If this ellipsoid is non- 
intersecting, then it can still split later and hence we do not lose 
accuracy. On the other hand, if X-means finds more clusters than 
the true number of modes, it is again likely that the corresponding 
enclosing ellipsoids will overlap. It is only in the rare case where 
some of such ellipsoids are non-intersecting, that the possibility 
exists for missing part of the true prior volume. Our use of an en- 
largement factor strongly mitigates against this occurring. Indeed, 
we have not observed such behaviour in any of our numerical tests. 

5.4 Evaluating 'local' evidences 

For a multimodal posterior, it can prove useful to estimate not only 
the total (global) evidence, but also the 'local' evidences associated 
with each mode of the distribution. There is inevitably some arbi- 
trariness in defining these quantities, since modes of the posterior 
necessarily sit on top of some general 'background' in the prob- 
ability distribution. Moreover, modes lying close to one another 
in the parameter space may only 'separate out' at relatively high 
likelihood levels. Nonetheless, for well-defined, isolated modes, a 
reasonable estimate of the posterior volume that each contains (and 
hence the local evidence) can be defined and estimated. Once the 
nested sampling algorithm has progressed to a likelihood level such 
that (at least locally) the 'footprint' of the mode is well-defined, one 
needs to identify at each subsequent iteration those points in the 
live set belonging to that mode. The practical means of performing 
this identification and evaluating the local evidence for each mode 
differs between our two sampling methods. 

5.4.1 Method 1 

The key feature of this method is that at each iteration the full live 
set of A'^ points is evolved by replacing the lowest likelihood point 
with one drawn (consistently) from the complete set of (potentially 
overlapping) ellipsoids. Thus, once a likelihood level is reached 

^ A non-ancestor ellipsoid of is any ellipsoid that was non-intersecting 
at an earlier iteration and does not completely enclose e j. . 



such that the footprint of some mode is well defined, to evaluate 
its local evidence one requires that at each subsequent iteration the 
points associated with the mode are consistently identified as a sin- 
gle cluster. If such an identification were possible, at the ith itera- 
tion one would simply proceeds as follows: (i) identify the cluster 
(contained within the ellipsoid e;) to which the point with the low- 
est likelihood Li value belongs; (ii) update the local prior volume of 
each of the clusters as — {nk/N)Xi, where Uk is the number 
of points belonging to the klh cluster and Xi is the total remaining 
prior volume; (iii) increment the local evidence of the cluster con- 
tained within e; by ^Li{Xj^'2i — X^^i). Unfortunately, we have 
found that X-means is not capable of consistently identifying the 
points associated with some mode as a single cluster. Rather, the 
partitioning of the live point set into clusters can vary appreciably 
from one iteration to the next. PG-means produced reasonably con- 
sistent results, but as mentioned above is far too computationally 
intensive. We are currently exploring ways to reduce the most com- 
putationally expensive step in PG-means of calculating the critical 
values for Kolmogorov-Smirnov test, but this is not yet completed. 
Thus, in the absence of a fast and consistent clustering algorithm, 
it is currently not possible to calculate the local evidence of each 
mode with our simultaneous ellipsoidal sampling algorithm. 



5.4.2 Method 2 

The key feature of this method is that once a cluster of points has 
been identified such that its (enlarged) enclosing ellipsoid does not 
intersect with any of its sibling ellipsoids (or any other non-ancestor 
ellipsoid), that set of points is evolved independently of the rest (af- 
ter topping up the number of points in the cluster to TV). This ap- 
proach therefore has some natural advantages in evaluating local 
evidences. There remain, however, some problems associated with 
modes that are sufficiently close to one another in the parameter 
space that they are only identified as separate clusters (with non- 
intersecting enclosing ellipsoids) once the algorithm has proceeded 
to likelihood values somewhat larger than the value at which the 
modes actually separate. In such cases, the local evidence of each 
mode will be underestimated. The simplest solution to this prob- 
lem would be to increment the local evidence of each cluster even 
if its corresponding ellipsoid intersects with other ellipsoids, but as 
mentioned above X-means cannot produce the consistent clustering 
required. In this case we have the advantage of knowing the itera- 
tion beyond which a non-intersecting ellipsoid is regarded as a sep- 
arate mode (or a collection of modes) and hence we can circumvent 
this problem by storing information (eigenvalues, eigenvectors, en- 
largement factors etc.) of all the clusters identified, as well as the 
rejected points and their likelihood values, from the last few itera- 
tions. We then attempt to match the clusters in the current iteration 
to those identified in the last few iterations, allowing for the inser- 
tion or rejection of points from clusters during the intervening iter- 
ations On finding a match for some cluster in a previous iteration i' , 
we check to see which (if any) of the points discarded between the 
iteration i' and the current iteration i were members of the cluster. 
For each iteration j (between i' and i inclusive) where this occurs, 
the local evidence of the cluster is incremented by LjXj, where Lj 
and Xj are the lowest likelihood value and the remaining prior vol- 
ume corresponding to iteration j. This series of operations can be 
performed quite efficiently; even storing information as far as back 
as 15 iterations does not increase the running time of the algorithm 
appreciably. Finally, we note that if closely lying modes have very 
different amplitudes, the mode(s) with low amplitude may never 
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Figure 5. Cartoon of the sub-clustering approach used to deal with degen- 
eracies. The true iso-likelihood contour contains the shaded region. The 
large enclosing ellipse is typical of that constructed using our basic method, 
whereas sub-clustering produces the set of small ellipses. 



be identified as being separate and will eventually be lost as the 
algorithm moves to higher likelihood values. 



5.5 Dealing with degeneracies 

As will be demonstrated in Sec.l?] the above methods are very ef- 
ficient and robust at sampling from multimodal distributions where 
each mode is well-described at most likelihood levels by a multi- 
variate Gaussian. Such posteriors might be described colloquially 
as resembling a 'bunch of grapes' (albeit in many dimensions). 
In some problems, however, some modes of the posterior might 
possess a pronounced curving degeneracy so that it more closely 
resembles a (multidimensional) 'banana'. Such features are prob- 
lematic for all sampling methods, including our proposed ellip- 
soidal sampling techniques. Fortunately, we have found that a sim- 
ple modification to our methods allows for efficient sampling even 
in the presence of pronounced degeneracies. 

The essence of the modification is illustrated in Fig. |5] Con- 
sider an isolated mode with an iso-likelihood contour displaying a 
pronounced curved degeneracy. X-means will usually identify all 
the live points contained within it as belonging to a single clus- 
ter and hence the corresponding (enlarged) ellipsoid will represent 
a very poor approximation. If, however, one divides each cluster 
identified by X-means into a set of sub-clusters, one can more ac- 
curately approximate the iso-likelihood contour with many small 
overlapping ellipsoids and sample from them using the method out- 
lined in Sec. [5T4l 

To sample with maximum efficiency from a pronounced de- 
generacy (particularly in higher dimensions), one would like to di- 
vide every cluster found by X-means into as many sub-clusters as 
possible to allow maximum flexibility in following the degeneracy. 
In order to be able to calculate covariance matrices, however, each 
sub-cluster must contain at least (D + 1) points, where D is the 
dimensionality of the parameter space. This in turn sets an upper 
limit on the number of sub-clusters. 

Sub-clustering is performed through an incremental fc-means 
algorithm with k — 2. The process starts with all the points as- 
signed to the original cluster. At iteration i of the algorithm, a point 
is picked at random from the sub-cluster Cj that contains the most 
points. This point is then set as the centroid, rrii+i, of a new cluster 
Ci+i . All those points in any of the other sub-clusters that are closer 
to rrii+i than the centroid of their own sub-cluster, and whose sub- 
cluster has more than (D + 1) points are then assigned to c^+i and 
rrii+i is updated. All the points not belonging to c^+i are again 
checked with the updated mi+i until no new point is assigned to 



Ci+i. At the end of the iteration i, if c^+i has less than (D + 1) 
points then the points in Cj that are closest to rrii+i are assigned to 
Ci+i until Ci+i has {D + 1) points. In the case that Cj has fewer 
than 2{D + 1) points, then points are assigned from Ci+i to Cj . The 
algorithm stops when, at the start of an iteration, the sub-cluster 
with most points has fewer than 2(D + 1) members, since that 
would result in a new sub-cluster with fewer than 2{D + 1) points. 
This process can result in quite a few sub-clusters with more than 
2{D + 1) but less than 2{D + 1) points and hence there is a pos- 
sibility for even more sub-clusters to be formed. This is achieved 
by finding the sub-cluster c; closest to the cluster, Ck . If the sum of 
points in c; and Ck is greater than or equal to 3(Z)+1), an additional 
sub-cluster is created out of them. 

Finally, we further reduce the possibility that the union of the 
ellipsoids corresponding to different sub-clusters might not enclose 
the entire remaining prior volume as follows. For each sub-cluster 
Ck, we find the one point in each of the n nearest sub-clusters that is 
closest to the centroid of . Each such point is then assigned to Ck 
and its original sub-cluster, i.e. it is 'shared' between the two sub- 
clusters. In this way all the sub-clusters and their corresponding 
ellipsoids are expanded, jointly enclosing the whole of the remain- 
ing prior volume. In our numerical simulations, we found setting 
n = 5 performs well. 



6 METROPOLIS NESTED SAMPLING 

An alternative method for drawing samples from the prior within 
the hard constraint L > Li where Li is the lowest likelih ood value 
at itera tion i, is the stan dard Metropolis algorithm (see e.g. lMacKavl 
( l2003l) ) as suggested in lSivia et al.l ( I2006D . In this approach, at each 
iteration, one of the live points, 0, is picked at random and a new 
trial point, ©', is generated using a symmetric proposal distribution 
(5(0', 0). The trial point ©' is then accepted with probability 

il if 7r(0') > 7r(0) andL(0') > Li 

7r(0')/7r(0) if 7r(0') s; 7r(0) andL(0') > L, (16) 
otherwise 

A symmetric Gaussian distribution is often used as the proposal dis- 
tribution. The dispersion a of this Gaussian should be sufficiently 
large compared to the size of the region satisfying L > Li that the 
chain is reasonably mobile, but without being so large that the like- 
lihood constraint stops nearly all proposed moves. Since an inde- 
pendent sample is required, ristop steps are taken by the Metropolis 
algorithm so that the chain diffuses far away from the starting po- 
sition and the memory of it is lost. In principle, one could calcu- 
late convergence statistics to determin e at which point th e chain is 
sampling from the target distribution. ISivia et al] | |2006|) propose, 
however, that one should instead simply take natep ~ 20 steps 
in all cases. The appropriate value of a tends to diminish as the 
nested algorithm moves towards higher likelihood regions and de- 
creasing prior mass. Hence, the value of a is updated at the end of 
each nested sampling iteration, so that the acceptance rate is around 
50%, as follows: 



if iVa > iVr 



(17) 



where A'a and A'r are the numbers of accepted and rejected samples 
in the latest Metropolis sampling phase. 

In principle, this approach can be used quite generally and 
does not require any clustering of the live points or construction 
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of ellipsoidal bounds. In order to facilitate the evaluation of 'lo- 
cal' evidences, however, we combine this approach with the clus- 
tering process performed in Method 2 above to produce a hybrid 
algorithm, which we describe below. Moreover, as we show in Sec- 
tion l7.1l this hybrid approach is significantly more efficient in sam- 
pling from multimodal posteriors than using just the Metropolis 
algorithm without clustering. 

At each iteration of the nested sampling process, the set of 
live points is partitioned into clusters, (enlarged) enclosing ellip- 
soids are constructed, and overlap detection is performed precisely 
in the clustered ellipsoidal method. Once again, the nested sam- 
pling algorithm is then continued separately for each cluster con- 
tained within a non-intersecting ellipsoid efc. This proceeds by (i) 
topping up the number of points in each cluster to TV by sampling 
N — rik points that satisfy L > Li using the Metropolis method 
described above, and (ii) setting the corresponding remaining prior 
mass to Xj^''' — (n^/A''). Prior to topping up a cluster in step 
(i), a 'mini' bum-in is performed during which the width ak of the 
proposal distribution is adjusted as described above; the width at 
is then kept constant during the topping-up step. 

During the sampling the starting point for the random walk 
is chosen by picking one of the ellipsoids with probability equal 
to its volume fraction: 

Pk = 14/Kot, (18) 

where Vk is the volume occupied by the ellipsoid Ck and Vtot = 
T4, and then picking randomly from the points lying inside 
the chosen ellipsoid. This is done so that the number of points in- 
side the modes is proportional to the prior volume occupied by 
those modes. We also supplement the condition J16| ( for a trial 
point to be accepted by the requirement that it must not lie inside 
any of the non-ancestor ellipsoids in order to avoid over-sampling 
any region of the prior space. Moreover, in step (i) if any sample 
accepted during the topping-up step lies outside its corresponding 
(expanded) ellipsoid, then that ellipsoid is dropped from the list of 
those to be explored as an isolated likelihood region in the current 
iteration since that would mean that the region has not truly sepa- 
rated from the rest of the prior space. 

Metropolis nested sampling can be quite efficient in higher- 
dimensional problems as compared with the ellipsoidal sampling 
methods since, in such cases, even a small region of an ellipsoid 
lying outide the true iso-likelihood contour would occupy a large 
volume and hence result in a large drop in efficiency. Metropo- 
lis nested sampling method does not suffer from this curse of di- 
mensionality as it only uses the ellipsoids to separate the isolated 
likelihood regions and consequently the efficiency remains approx- 
imately constant at ~ 1 /ustap , which is 5 per cent in our case. This 
will be illustrated in the next section in which Metropolis nested 
sampling is denoted as Method 3. 



7 APPLICATIONS 

In this section we apply the three new algorithms discussed in the 
previous sections to two toy problems to demonstrate that they in- 
deed calculate the Bayesian evidence and make posterior inferences 
accurately and efficiently. 

7.1 Toy model 1 

For our fir s t exam ple, we consider the problem investigated by 
IShaw et al.1 j2007h as their Toy Model II, which has a posterior of 




Figure 6. Toy Model la: a two-dimensional posterior consisting of the sum 
of 5 Gaussian peaks of varying width and height placed randomly in the 
unit circle in the xjz-plane. The dots denote the set of live points at each 
successive likelihood level in the nested sampling algorithm using Method 
1 (simultaneous elhpsoidal sampling). 



Cluster 1 
Cluster 2 




Figure 7. As in Fig.|6] but using Method 2 (clustered ellipsoidal sampling). 
The different colours denote points assigned to isolated clusters as the algo- 
rithm progresses. 



known functional form so that an analytical evidence is available 
to compare with those found by our nested sampling algorithms. 
The two-dimensional posterior consists of the sum of 5 Gaussian 
peaks of varying width, ak, and amplitude, Ak, placed randomly 
within the unit circle in the xy-plane. The parameter values defin- 
ing the Gaussians are listed in Table[T] leading to an analytical total 
log-evidence \nZ = —5.271. The analytical 'local' log-evidence 
associated with each of the 5 Gaussian peaks is also shown in the 
table. 



Peak 


X 


Y 


A 


(7 


Local in Z 


1 


-0.400 


-0.400 


0.500 


0.010 


-9.210 


2 


-0.350 


0.200 


1.000 


0.010 


-8.517 


3 


-0.200 


0.150 


0.800 


0.030 


-6.543 


4 


0.100 


-0.150 


0.500 


0.020 


-7.824 


5 


0.450 


0.100 


0.600 


0.050 


-5.809 



Table 1. The parameters X^, Yj., Aj., crj. defining the 5 Gaussians in Fig.|6] 
The log-volume (or local log-evidence) of each Gaussian is also shown. 
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Toy model la Method 1 Method 2 Method 3 Shaw et al. 

InZ -5.247 -5.178 -5.358 -5.296 

EiTor 0.110 0.112 0.115 0.084 

A^iikc 39,911 12,569 161,202 101,699 



Table 2. The calculated global log-evidence, its uncertainty and the number 
of likelihood evaluations required in analysing Toy model la using Method 
1 (simultaneous nested sampling), Method 2 (cluster ed ellipsoidal sam - 
pling) and the recursive clustering method described bv lShaw et alj JioOTi) . 
The values correspond to a single run of each algorithm. The analytical 
global log-evidence is —5.271. 



Peak X Y Localln.Z 



1 


-0.400 ± 0.002 


-0.400 ± 0.002 


-9.544 ±0.162 


2 


-0.350 ±0.002 


0.200 ± 0.002 


-8.524 ±0.161 


3 


-0.209 ± 0.052 


0.154 ±0.041 


-6.597 ±0.137 


4 


0.100 ±0.004 


-0.150 ±0.004 


-7.645 ±0.141 


5 


0.449 ±0.011 


0.100 ±0.011 


-5.689 ±0.117 



Table 3. The infeired mean values of X and Y and the local evidence for 
each Gaussian peak in Toy model 1 a using Method 2 (clustered ellipsoidal 
sampling). 



The results of applying Method 1 (simultaneous ellipsoidal 
sampling). Method 2 (clustered ellipsoidal sampling) to this prob- 
lem are illustrated in Figs|6]and[7]respectively; a very similar plot 
to Fig. [7] is obtained for Method 3 (Metropolis nested sampling). 
For all three methods, we used A*' = 300 live points, switched 
off the sub-clustering modification (for methods 1 and 2) outlined 
in Sec. 15.51 and assumed a flat prior within the unit circle for the 
parameters X and Y in this two-dimensional problem. In each fig- 
ure, the dots denote the set of live points at each successive likeli- 
hood level in the nested sampling algorithm. For methods 2 and 3, 
the different colours denote points assigned to isolated clusters as 
the algorithm progresses. We see that all three algorithms sample 
effectively from all the peaks, even correctly isolating the narrow 
Gaussian peak (cluster 2) superposed on the broad Gaussian mode 
(cluster 3). 

The global log-evidence values, their uncertainties and the 
number of likelihood evaluations required for each method are 
shown in Table |2] Methods 1, 2 and 3, all produce evidence val- 
ues that are accurate to within the estimated uncertainties. Also, 
listed in the table are the corresponding quantities obtained by 
IShaw et al. I ( l2007h . which are clearly consistent. Of particular in- 
terest, is the number of likelihood evaluations required to produce 
these evidence estimates. Methods 1 and 2 made around 40,000 and 
10,000 likelihood evaluations respectively, whereas the Shaw et al. 
method required more than 3 times this number (in all cases just 
one run of the algorithm was performed, since multiple runs are 
not required to estimate the uncertainty in the evidence). Method 
3 required about 170,000 likelihood evaluations since its efficiency 
remains constant at around 5%. It should be remembered that Shaw 
et al. showed that using thermodynamic integration, and perform- 
ing 10 separate runs to estimate the error in the evidence, required 
3.6 X 10^ likelihood evaluations to reach the same accuracy. As 
an aside, we also investigated a 'vanilla' version of the Metropolis 
nested sampling approach, in which no clustering was performed. 
In this case, over 570,000 likelihood evaluations were required to 
estimate the evidence to the same accuracy. This drop in efficieny 
relative to Method 3 resulted from having to sample inside differ- 
ent modes using a proposal distribution with the same width a in 
every case. This leads to a high rejection rate inside narrow modes 
and random walk behaviour in the wider modes. In higher dimen- 
sions this effect will be exacerbated. Consequently, the clustering 
process seems crucial for sampling efficiently from multimodal dis- 
tributions of different sizes using Metropolis nested sampling. 

Using methods 2 (clustered ellipsoidal sampling) and 3 
(Metropolis sampling) it is possible to calculate the 'local' evidence 
and make posterior inferences for each peak separately. For Method 
2, the mean values inferred for the parameters X and Y and the lo- 
cal evidences thus obtained are listed in Table[3l and clearly com- 



Toy model lb 


Real Value 


Method 2 


Method 3 


In 2 


4.66 


4.47±0.20 


4.52±0.20 


local In Z\ 


4.61 


4.38±0.20 


4.40±0.21 


local In Z2 


1.78 


1.99±0.21 


2.15±0.21 


local In 


0.00 


0.09±0.20 


0.09±0.20 


Mike 




130,529 


699,778 



Table 4. The true and estimated global log-evidence, local log-evidence 
and number of likelihood evaluations required in analysing Toy model lb 
using Method 2 (clustered ellipsoidal sampling) and Method 3 (Metropolis 
sampling). 



pare well with the true values given in Table[T] Similar results were 
obtained using Method 3. 

In real applications the parameter space is usually of higher 
dimension and different modes of the posterior may vary in ampli- 
tude by more than an order of magnitude. To investigate this situ- 
ation, we also considered a modified problem in which three 10- 
dimensional Gaussians are placed randomly in the unit hypercube 
[0, 1] and have amplitudes differing by two orders of magnitude. 
We also make one of the Gaussians elongated. The analytical local 
log-evidence values and those found by applying Method 2 (with- 
out sub-clustering) and Method 3 are shown in Table |4] We used 
A'^ = 600 live points with both of our methods. 

We see that both methods detected all 3 Gaussians and cal- 
culated their evidence values with reasonable accuracy within the 
estimated uncertainties. Method 2 required ~ 4 times fewer like- 
lihood calculations than Method 3, since in this problem the ellip- 
soidal methods can still achieve very high efficiency (28 per cent), 
while the efficiency of the Metropolis method remains constant (5 
per cent) as discussed in Sec. [6] 



7.2 Toy model 2 

We now illustrate the capabilities of our methods in sampling from 
a posterior containing multiple modes with pronounced (curving) 
degeneracies i n high dimensions. Ou r toy problem is based on that 
investigated bv lAllanachetal.l ( l2007l) , but we extend it to more than 
two dimensions. 

The likelihood function is defined as. 



L{0) = circ(0; ci, ri, toi) + circ(0; C2, r2, W2) 

{\e-c\-rf 



where 

circ(0; c, r, w 



: exp 



(19) 



(20) 



In 2-dimensions, this toy distribution represents two well separated 
rings, centred on the points ci and C2 respectively, each of radius 
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Figure 8. Toy model 2: a two-dimensional example of the likelihood func- 
tion defined in (19) and <20l . 



r and with a Gaussian radial profile of width w (see Fig.[8). With 
a sufficiently small w value, this distribution is representative of 
the likelihood functions one might encounter in analysing forth- 
coming particle physics experiments in the context of beyond-the- 
Standard-Model paradigms; in such models the bulk of the proba- 
bility lies within thin sheets or hypersurfaces through the full pa- 
rameter space. 

We investigate the above distribution up to a 100-dimensional 
parameter space 9. In all cases, the centers of the two rings are sep- 
arated by 7 units in the parameter space, and we take wi = W2 = 
0.1 and ri = r2 = 2. We make ri and r2 equal, since in higher 
dimensions any slight difference between these two would result in 
a vast difference between the volumes occupied by the rings and 
consequently the ring with the smaller r value would occupy van- 
ishingly small probability volume making its detection almost im- 
possible. It should also be noted that setting w = Q.l means the 
rings have an extremely narrow Gaussian profile and hence they 
represent an 'optimally difficult' problem for our ellipsoidal nested 
sampling algorithms, even with sub-clustering, since many tiny el- 
lipsoids are required to obtain a sufficiently accurate representa- 
tion of the iso-likelihood surfaces. For the two-dimensional case, 
with the parameters described above, the likelihood function is that 
shown in Fig. [8] 

Sampling from such a highly non-Gaussian and curved dis- 
tribution can be very difficult and inefficient, especially in higher 
dimensions. In such problems a re-parameterization is usually per- 
formed to transf orm the distribution i nto one that is geometr ically 
simpler (see e.g. lDunklev et al.l ( l200^ and lVerde etaLl ( l2003h ), but 
such approaches are generally only feasible in low-dimensional 
problems. In general, in D dimensions, the transformations usu- 
ally employed introduce D — 1 additional curvature parameters and 
hence become rather inconvenient. Here, we choose not to attempt 
a re-parameterization, but instead sample directly from the distri- 
bution. 

Applying the ellipsoidal nested sampling approaches (meth- 
ods 1 and 2) to this problem without using the sub-clustering modi- 
fication would result in highly inefficient sampling as the enclosing 
ellipsoid would represent an extremely poor approximation to the 
ring. Thus, for this problem, we use Method 2 with sub-clustering 
and Method 3 (Metropolis nested sampling). We use 400 live points 
in both algorithms. The sampling statistics are listed in Table |5] 
and Table |6] respectively. The 2-dimensional sampling results us- 
ing Method 2 (with sub-clustering) are also illustrated in Fig.|9] in 



Common Points 
Peak 1 
Peak 2 




Figure 9. Toy Model 2: a two-dimensional posterior consisting of two rings 
with naiTow Gaussian profiles as defined in equation l20l The dots denote the 
set of live points at each successive likelihood level in the nested sampling 
algorithm using Method 2 (with sub-clustering). 





Method 2 (with sub-clustering) 


Method 3 


D 


Mike 


Efficiency 


Mike 


Efficiency 


2 


27, 658 


15.98% 


76, 993 


6.07% 


5 


69,094 


9.57% 


106,015 


6.17% 


10 


579, 208 


1.82% 


178, 882 


5.75% 


20 


43,093,230 


0.05% 


391, 113 


5.31% 


30 






572, 542 


5.13% 


50 






1,141,891 


4.95% 


70 






1, 763, 253 


4.63% 


100 






3, 007, 889 


4.45% 



Table 6. The number of likelihood evaluations and sampling efficiency for 
Method 2 (with sub-clustering) and Method 3 when applied to toy model 3, 
as a function of the dimension D of the parameter space 



which the set of live points at each successive likelihood level is 
plotted; similar results are obtained using Method 3. 

We see that both methods produce reliable estimates of the 
global and local evidences as the dimension D of the parame- 
ter space increases. As seen in Table (6] however, the efficieny of 
Method 2, even with sub-clustering, drops significantly with in- 
creasing dimensionality. As a result, we do not explore the prob- 
lem with method 2 for dimensions greater than D = 20. This drop 
in efficiency is caused by (a) in higher dimensions even a small 
region of an ellipsoid that lies outside the true iso-likelihood con- 
tour occupies a large volume and hence results in a drop in sam- 
pling efficiency; and (b) in D dimensions, the minimum number 
of points in an ellipsoid can be [D + 1), as discussed in Sec. 15.51 
and consequently with a given number of live points, the number of 
sub-clusters decreases with increasing dimensionality, resulting in 
a poor approximation to the highly curved iso-likelihood contour. 
Nonetheless, Method 3 is capable of obtaining evidence estimates 
with reasonable efficiency up to D = 100, and should continue to 
operate effectively at even higher dimensionality. 



8 BAYESIAN OBJECT DETECTION 

We now consider how our multimodal nested sampling approaches 
may be used to address the difficult problem of detecting and char- 
acterizing discrete objects hidden in some background noise. A 
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Analytical Method 2 (with sub-clustering) Method 3 

D \nZ local liiZ \nZ local In Z\ local in Z2 In Z local In Z\ local In Z2 



2 


-1.75 


-2.44 


-1.71 ± 0.08 


-2.41 ±0.09 


-2.40 ±0.09 


-1.63 ± 0.08 


-2.35 ± 0.09 


-2.31 ±0.09 


5 


-5.67 


-6.36 


-5.78 ± 0.13 


-6.49 ±0.14 


-6.46 ± 0.14 


-5.69 ± 0.13 


-6.35 ± 0.13 


-6.41 ±0.14 


10 


-14.59 


-15.28 


-14.50 ±0.20 


-15.26 ±0.20 


-15.13 ± 0.20 


-14.31 ± 0.19 


-15.01 ± 0.20 


14.96 ±0.20 


20 


-36.09 


-36.78 


-35.57 ±0.30 


-36.23 ±0.30 


-36.20 ±0.30 


-36.22 ± 0.30 


-36.77 ±0.31 


-37.09 ±0.31 


30 


-60.13 


-60.82 








-60.49 ± 0.39 


-61.69 ± 0.39 


-60.85 ±0.39 


50 


-112.42 


-113.11 








-112.27 ±0.53 


-112.61 ± 0.53 


-113.53 ±0.53 


70 


-168.16 


-168.86 








-167.71 ± 0.64 


-167.98 ± 0.64 


-169.32 ±0.65 


100 


-255.62 


-256.32 








-253.72 ± 0.78 


-254.16 ± 0.78 


-254.77 ±0.78 



Table 5. The true and estimated global and local log Z for toy model 3, as a function of the dimensions D of the parameter space, using Method 2 (with 
sub-clustering) and Method 3. 



Bayesian approach to this problem in an astrophysical context was 
first presented by Hobson & McLachlan (2003; hereinafter HM03), 
and our general framework follows this closely. For brevity, we will 
consider our data vector D to denote the pixel values in a single 
image in which we wish to search for discrete objects, although D 
could equally well represent the Fourier coefficients of the image, 
or coefficients in some other basis. 



8.1 Discrete objects in background 

Let us suppose we are interested in detecting and characterising 
some set of (two-dimensional) discrete objects, each of which is 
described by a template t{x; a), which is parametrised in terms of 
a set of parameters a that might typically denote (collectively) the 
position {X, Y) of the object, its amplitude A and some measure 
R of its spatial extent. In particular, in this example we will assume 
circularly-symmetric Gaussian-shaped objects defined by 



t{x; a) — AexTp 



2R? 



(21) 



so that a — {X, Y, A, R}. If A^obj such objects are present and the 
contribution of each object to the data is additive, we may write 



Object 


X 


Y 


A 


R 


1 


43.71 


22.91 


10.54 


3.34 


2 


101.62 


40.60 


1.37 


3.40 


3 


92.63 


110.56 


1.81 


3.66 


4 


183.60 


85.90 


1.23 


5.06 


5 


34.12 


162.54 


1.95 


6.02 


6 


153.87 


169.18 


1.06 


6.61 


7 


155.54 


32.14 


1.46 


4.05 


8 


130.56 


183.48 


1.63 


4.11 



Table 7. The parameters Xk, Yj; , Afe 
Gaussian shaped objects in Fig. 1 101 



and Rk {k = 1, 8) defining the 



2 units. This corresponds to a signal-to-noise ratio 0.5-1 as com- 
pared to the peak amplitude of each object (ignoring the first ob- 
ject). It can be seen from the figure that with this level of noise, 
apart from the first object, only a few objects are (barely) visible 
with the naked eye and there are certain areas where the noise con- 
spires to give the impression of an object where none is present. 
This toy problem thus presents a considerable challenge for any 
object detection algorithm. 



D 



(22) 



where s{ak) denotes the contribution to the data from the fcth dis- 
crete object and n denotes the generalised 'noise' contribution to 
the data from other 'background' emission and instrumental noise. 
Clearly, we wish to use the data D to place constraints on the val- 
ues of the unknown parameters A'obj and ak {k — 1, . . . , Nohj)- 



8.2 Simulated data 

Our underlying model and simulated data are shown in Fig. [TO] 
and are similar to the example considered by HMOS. The left panel 
shows the 200 x 200 pixel test image, wliich contains 8 Gaussian 
objects described by eq. i2H with the parameters Xk, Yk, Ak and 
Rk {k — 1, ...,8) listed in Table [7] The X and Y coordinates 
are drawn independently from the uniform distribution (7(0, 200). 
Similarly, the amplitude A and size R of each object are drawn 
independently from the uniform distributions U{1, 2) and (7(3, 7) 
respectively. We multiply the amplitude of the first object by 10 to 
to see how senstive our nested sampling methods are to this order 
of magnitude difference in amplitudes. The simulated data map is 
created by adding independent Gaussian pixel noise with an rms of 



8.3 Defining the posterior distribution 

As discussed in HM03, in analysing the above simulated data map 
the Bayesian purist would attempt to infer simultaneously the full 
set of parameters = (A'obj, ai, 02, aiv^bj)- The crucial 
complication inherent to this approach is that the length of the pa- 
rameter vector is variable, since it depends on the unknown value 
A'obj . Thus any sampling based approach must be able to move be- 
tween spaces of different dimensionality, and such techniques are 
investigated in HM03. 

An alternative approach, also discussed by HM03, is simply 
to set A'obj ~ 1. In other words, the model for the data consists of 
just a single object and so the full parameter space under consider- 
ation is a = {X, Y, A, R}, wliich is fixed and only 4-dimensional. 
Although fixing A'obj = 1, it is important to understand that this 
does not restrict us to detecting just a single object in the data map. 
Indeed, by modelling the data in this way, we would expect the 
posterior distribution to possess numerous local maxima in the 4- 
dimensional parameter space, each corresponding to the location in 
this space of one of the objects present in the image. HM03 show 
this vastly simplified approach is indeed reliable when the objects 
of interest are spatially well-separated, and so for illustration we 
adopt this method here. 
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Figure 10. The toy model discussed in Sec. W2\ The 200 X 200 pixel test image (left panel) contains 8 Gaussian objects of varying widths and amplitudes; 
the parameters Xj., Yf., A/, and R)^ for each object are listed in Table|2] The right panel shows the corresponding data map with independent Gaussian noise 
added with an rms of 2 units. 



In this case, if the background 'noise' n is a statistically ho- 
mogeneous Gaussian random field with covariance matrix N = 
(nn'), then the likelihood function takes the form 

, _ exp {-1 [D - s{a)f N'^ [D - s{a)]} 

^^"^ (2vr)--/^|N|^''^ • ^''^ 

In our simple problem the background is just independent pixel 
noise, so N = a^I, where a is the noise rms. The prior on the 
parameters is assumed to be separable, so that 



7r(a) = ■R{X)TT{Y)Tv{A)n{R). 



(24) 



The priors on X and Y are taken to be the uniform distribution 
U (0, 200), whereas the priors on A and R are taken as the uniform 
distributions (7(1, 12.5) and (7(2, 9) respectively. 

The problem of object identification and characterization then 
reduces to sampling from the (unnormalised) posterior to infer pa- 
rameter values and calculating the 'local' Bayesian evidence for 
each detected object to assess the probability that it is indeed real. 
In the most straightforward approach, the two competing models 
between which we must select are Hq — 'the detected object is 
fake {A = 0)' and Hi = 'the detected object is real (A > 0)'. 
One could, of course, consider alternative definitions of these hy- 
potheses, such as setting Hq: A ^ Au^ and Hi: A > Au^, where 
Aiim is some (non-zero) cut-off value below which one is not inter- 
ested in the identified object. 



8.4 Results 

Since Bayesian object detection is of such interest, we analyse this 
problem using methods I, 2 and 3. For methods I and 2, do not use 
sub-clustering, since the posterior peaks are not expected to exhibit 
pronounced (curving) degeneracies. We use 400 live points with 
method 1 and 300 with methods 2 and 3. In methods I and 2, the 
initial enlargement factor was set to /o — 0.3. 

In Fig. [TT]we plot the live points, projected into the {X, Y)- 
subspace, at each successive likelihood level in the nested sampling 
algorithm (above an arbitrary base level) for each method. For the 





Method 1 


Method 2 


Method 3 




(no sub-clustering) 


(no sub-clustering) 




InZ 


-84765.63 


-84765.41 


-84765.45 


Error 


0.20 


0.24 


0.24 


Mike 


55,521 


74,668 


478,557 



Table 8. Summary of the global evidence estimates for toy model 3 and the 
number of likelihood evaluations required using different sampling meth- 
ods. The 'null' log-evidence for the model in which no object is present is 
-85219.44. 



methods 2 and 3 results, plotted in panel (b) and (c) respectively, the 
different colours denote points assigned to isolated clusters as the 
algorithm progresses; we note that the base likelihood level used in 
the figure was chosen to lie slightly below that at which the individ- 
ual clusters of points separate out. We see from the figure, that all 
three approaches have successfully sampled from this highly mul- 
timodal posterior distribution. As discussed in HM03, this repre- 
sents a very difficult problem for traditional MCMC methods, and 
illustrates the clear advantages of our methods. In detail, the figure 
shows that samples are concentrated in 8 main areas. Comparison 
with Fig.llOlshows that 7 of these regions do indeed correspond to 
the locations of the real objects (one being a combination of two 
real objects), whereas the remaining cluster corresponds to a 'con- 
spiration' of the background noise field. The CPU time required 
for Method 1 was only ~ 5 minutes on a single Itanium 2 (Madi- 
son) processor of the COSMOS supercomputer; each processor has 
a clock speed of 1.3 GHz, a 3Mb L3 cache and a peak performance 
of 5.2 Gflops. 

The global evidence results are summarised in Table [8] We 
see that all three approaches yield consistent values within the esti- 
mated uncertainties, which is very encouraging given their consid- 
erable algorithmic differences. 

We note, in particular, that Method 3 required more than 6 
times the number of likelihood evaluations as compared to the el- 
lipsoidal methods. This is to be expected given the non-degenerate 
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Cluster 


local In Z 


X 


Y 


A 


R 


1 


—84765.41 ± 0.24 


43.82 ± 0.05 


23.17 ± 0.05 


10.33 ± 0.15 


3.36 ± 0.03 


2 


—85219.61 ± 0.19 


100.10 ± 0.26 


40.55 ± 0.32 


1.93 ± 0.16 


2.88 ± 0.15 


3 


—85201.61 ± 0.21 


92.82 ± 0.14 


110.17 ± 0.16 


3.1/ ± 0.26 


2.42 ± 0.13 


4 


-85220.34 ±0.19 


182.33 ± 0.48 


85.85 ± 0.43 


1.11 ±0.07 


4.85 ± 0.30 


5 


-85194.16 ± 0.19 


33.96 ± 0.36 


161.50 ±0.35 


1.56 ±0.09 


6.28 ± 0.29 


6 


-85185.91 ± 0.19 


155.21 ±0.31 


169.76 ± 0.32 


1.69 ±0.09 


6.48 ± 0.24 


7 


-85216.31 ± 0.19 


154.87 ± 0.32 


31.59 ±0.22 


1.98 ±0.17 


3.16 ±0.20 


8 


-85223.57 ±0.21 


158.12 ±0.17 


96.17 ±0.19 


2.02 ±0.10 


2.15 ±0.09 



Table 9. The mean and standard deviation of the evidence and infen'ed object pai'ameters Xj. , Yfe , and for toy model 4 using Method 2. 



shape of the posterior modes and the low-dimensionality of this 
problem. The global evidence value of ~ —84765 may be inter- 
preted as corresponding to the model Hi — 'there is a real object 
somewhere in the image'. Comparing this with the 'null' evidence 
value ~ —85219 for Hq = 'there is no real object in the image', 
we see that Hi is strongly favoured, with a log-evidence difference 
of A in 2 ~ 454. 

In object detection, however, one is more interested in whether 
or not to believe the individual objects identified. As discussed in 
Sections 15.31 and \5A\ using Method 2 and Method 3, samples be- 
longing to each identified mode can be separated and local evi- 
dences and posterior inferences calculated. In Table|9l for each sep- 
arate cluster of points, we list the mean and standard error of the in- 
ferred object parameters and the local log-evidence obtained using 
method 2; similar results are obtained from Method 3. Considering 
first the local evidences and comparing them with the 'null' evi- 
dence of —85219.44, we see that all the identified clusters should 
be considered as real detections, except for cluster 8. Comparing 
the derived object parameters with the inputs listed in Table [T] we 
see that this conclusion is indeed correct. Moreover, for the 7 re- 
maining clusters, we see that the derived parameter values for each 
object are consisent with the true values. 

It is worth noting, however, that cluster 6 does in fact corre- 
spond to the real objects 6 and 8, as listed in Table|7] This occurs 
because object 8 lies very close to object 6, but has a much lower 
amplitude. Although one can see a separate peak in the posterior 
at the location of object 8 in Fig. ll If c) (indeed this is visible in all 
three panels). Method 2 was not able to identify a separate, isolated 
cluster for this object. Thus, one drawback of clustered ellipsoidal 
sampling method is that it may not identify all objects in a set lying 
very close together and with very different amplitudes. This prob- 
lem can be overcome by increasing the number of objects assumed 
in the model from A'^^bj = 1 to some appropriate larger value, but 
we shall not explore this further here. It should be noted however, 
that failure to separate out every real object has no impact on the 
accuracy of the estimated global evidence, since the algorithm still 
samples from a region that includes all the objects. 



9 DISCUSSION AND CONCLUSIONS 

In this paper, we have presented various me thods that allow the ap- 
plication of the nested sampling algorithm iSkilIing||2004h to gen- 
eral distributions, particular those with multiple modes and/or pro- 
nounced (curving) degeneracies. As a result, we have produced a 
general Monte Carlo technique capable of calculating Bayesian ev- 
idence values and producing posterior inferences in an efficient and 
robust manner. As such, our methods provide a viable alternative to 



MCMC techniques for performing Bayesian analyses of astronom- 
ical data sets. Moreover, in the analysis of a set of toy problems, we 
demonstrate that our methods are capable of sampling effectively 
from posterior distributions that have traditionally caused problems 
for MCMC approaches. Of particular interest is the excellent per- 
formance of our methods in Bayesian object detection and valida- 
tion, but our approaches should provide advantages in all areas of 
Bayesian astronomical data analysis. 

A critical analysis of Bayesian methods and MCMC sampling 
has recently been presented by Bryan et al. ( 2007), who advocate 
a frequentist approach to cosmological parameter estimation from 
the CMB power spectrum. While we refute wholeheartedly their 
criticisms of Bayesian methods per se, we do have sympathy with 
their assessment of MCMC metho ds as a po or means of performing 
a Bayesian inference. In particular. lBrvan e t al. ( 2007) note that for 
MCMC sampling methods "if a posterior is comprised by two nar- 
row, spatially separated Gaussians, then the probability of transition 
from one Gaussian to the other will be vanishingly small. Thus, af- 
ter the chain has rattled around in one of the peaks for a while, 
it will appear that the chain has converged; however, after some 
finite amount of time, the chain will suddenly jump to the other 
peak, revealing that the initial indications of convergence were in- 
correct." They also go on to point out that MCMC methods often 
require considerable tuning of the proposal distribution to sample 
efficiently, and that by their very nature MCMC samples are con- 
centrated at the peak(s) of the posterior distribution often leading to 
underestimation of confidence intervals when time allows only rel- 
atively few samples to be taken. We believe our multimodal nested 
sampling algorithms address all these critic isms. Perhaps of most 
relevance is the clai m bv .Bryan et alj j2007i) that their analysis of 
the 1-year WMAP teennett et al.ll2003l) identifies two distinct re- 
gions of high posterior probability in the cosmological parameter 
space. Such multimodality suggests that our methods will be ex- 
tremely useful in analysing WMAP data and we will investigate 
this in a forthcoming publication. 

The progress of our multimodel nested sampling algorithms 
based on ellipsoidal sampling (methods 1 and 2) is controlled by 
three main parameters: (i) the number of live points A'^; (ii) the ini- 
tial enlargement factor fo ; and (iii) the rate a at which the enlarge- 
ment factor decreases with decreasing prior volume. The approach 
based on Metropolis nested sampling (Method 3) depends only on 
A*'. These values can be chosen quite easily as outlined below and 
the performance of the algorithm is relatively insensitive to them. 
First, A'^ should be large enough that, in the initial sampling from 
the full prior space, there is a high probability that at least one point 
lies in the 'basin of attraction' of each mode of the posterior. In later 
iterations, live points will then tend to populate these modes. Thus, 
as a rule of thumb, one should take TV > V',r/'Knodc, where Vmoda 
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Figure 11. The set of live points, projected into the {X, y)-subspace, at 
each successive likelihood level in the nested sampling in the analysis of 
the data map in Fig JlOf right panel) using: (a) Method 1 (no sub-clustering); 
(b) Method 2 (no sub-clustering); and (c) Method 3. In (b) and (c) the dif- 
ferent colours denote points assigned to isolated clusters as the algorithm 
progresses. 



in l5.1.2l Third, a should be set in the range 0-1, but typically a 
value of Of ~ 0.2 is appropriate for most problems. The algorithm 
also depends on a few additional parameters, such as the number of 
previous iterations to consider when matching clusters in Method 
2 (see Section [53l ). and the number of points shared between sub- 
clusters when sampling from degeneracies (see Section |531 ). but 
there is generally no need to change them from their default values. 

Looking forward to the further development of our approach, 
we note that the new methods presented in this paper operate by 
providing an efficient means for performing the key step at each it- 
eration of a nested sampling process, namely drawing a point from 
the prior within the hard constraint that its likelihood is greater 
than that of the previous discarded point. In particular, we build 
on the ellipsoidal sam pling approaches previo usly suggested by 
iMukheriee eFZI j2006 'l and 'Shaw et alj jlOOH) . One might, how- 
ever, consider replacing each hard-edged ellipsoidal bound by some 
softer-edged smooth probability distribution. Such an approach 
would remove the potential (but extemely unlikely) problem that 
some part of the true iso-likelihood contour may lie outside the 
union of the ellipsoidal bounds, but it does bring additional com- 
plications. In particular, we explored the use of multivariate Gaus- 
sian distributions defined by the covariance matrix of the relevant 
live points, but found that the large tails of such distributions con- 
siderably reduced the sampling efficiency in higher-dimensional 
problems. The investigation of alternative distributions with heavier 
tails is ongoing. Another difficulty in using soft-edged distributions 
is that the method for sampling consistent from overlapping regions 
becomes considerably more complicated, and this too is currently 
under investigation. 

We intend to apply our new multimodal nested sampling meth- 
ods to a range of astrophysical data analysis problems in a number 
of forthcoming papers. Once we are satisfied that the code performs 
as anticipated in these test cases, we plan to make a Fortran library 
containing our routines publically available. Anyone wishing to use 
our code prior to the public release should contact the authors. 
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is (an estimate of) the volume of the posterior mode containing the 
smallest probability volume (of interest) and is the volume of 
the full prior space. It should be remembered, of course, that 
must always exceed the dimensionality D of the parameter space. 
Second, /o should usually be set in the range 0-0.5. At the ini- 
tial stages, a large value of fo is required to take into account the 
error in approximating a large prior volume with ellipsoids con- 
tracted from limited number of live points. Typically, a value of 
fo ~ 0.3 should suffice for A'^ ~ 300. The dynamic enlarge- 
ment factor fi,k gradually goes down with decreasing prior volume 
and consequently, increasing the sampling efficiency as discussed 
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